PARTIAL STEPWISE REGRESSION FOR DATA MINING 



Field of the Invention 

The present invention relates to the computer technology of the area of data mining. More 
particularly the presented invention relates to the data mining problem of predicting a given 
dependent variable using a multitude of independent variables. 

Description of Related Art 

Over the past two decades there has been a huge increase in the amount of data being stored in 
databases as well as the number of database applications in business and the scientific domain. 
This explosion in the amount of electronically stored data was accelerated by the success of the 
relational model for storing data and the development and maturing of data retrieval and 
manipulation technologies. While technology for storing the data developed fast enough to keep 
up with the demand, little attention was paid to developing software for analyzing the data until 
recently, when companies realized that hidden within these masses of data was a resource that 
was being ignored. The huge amounts of stored data contain knowledge about a number of 
aspects of their business waiting to be harnessed and used for more effective business decision 
support. Database Management Systems used to manage these data sets at present only allow the 
user to access information explicitly present in the databases i.e. the data. The data stored in the 
database is only a small part of the 'iceberg of information' available from it. Contained implicitly 
within this data is knowledge about a number of aspects of their business waiting to be harnessed 
and used for more effective business decision support. This extraction of knowledge from large 
data sets is called Data Mining or Knowledge Discovery in databases and is defined as the 
non-trivial extraction of implicit, previously unknown and potentially useful information from 
data. The obvious benefits of Data Mining have resulted in a lot of resources being directed 
towards its development. 

Data mining involves the development of tools that analyze large databases to extract useful 
information from them. As an application of data mining, customer purchasing patterns may be 
derived from a large customer transaction database by analyzing its transaction records. Such 
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purchasing habits can provide invaluable marketing information. For example, retailers can 
create more effective store displays and more effective inventory control than otherwise would be 
possible if they know consumer purchase patterns. As a further example, catalog companies can 
conduct more effective mass mailings if they know that, given that a consumer has purchased a 
5 first item, the same consumer can be expected, with some degree of probability, to purchase a 
particular second item within a particular time period after the first purchase. 

An important problem within this area of technology is the problem of predicting a given 
dependent data mining variable using a multitude of independent data mining variables. 

1 0 Typically a multitude of records is available representing expressions of a particular, yet 
O unknown functional relationship between the independent variables and a dependent data 
,]S variable. The difficulty is to use said multitude of records as training set for deriving the 
fi unknown functional relationship between the independent and the dependent variables, which 
Lfi can serve as a prediction model. Then the prediction model can be used to determine for all 

1 K possible values of the independent variable the value of the dependent variable. On one hand the 
: prediction model should be able to reproduce the values of the training set as well as possible, on 
M= the other hand the prediction model allows one to determine and thus to predict the value of the 
[fk independent variable for value combinations of the independent variables not comprised by the 
!= =^ training set. 

20 

The most common prediction methodology in data mining is the multiple linear regression 
approach and the so-called multiple polynomial regression approach, with the former 
representing a special case of the latter. The first category uses one and the same continuous 
function for each independent variable based on the assumption of a linear relationship between 
2 5 the dependent variable and one or more independent variables. Multiple polynomial regression 
assumes a certain polynomial relationship between the dependent and the independent variables 
and thus can be viewed as an extension of the multiple linear regression methodology. Multiple 
polynomial regression is a regression methodology that fits and approximates a given dependent 
variable y with n independent variables X,- based on a fitting model that uses a polynomial of a 
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certain predefined degree m common for all independent variables. Thus the assumed polynomial 
degree for each independent variable is identical. 

A simplified multiple polynomial regression model according to the state of the art is of the 
5 following form: 

y=A + i:Uf(Xi)+Error (eq. 1) 

wherey(Zi) = Ba * Xj + 5/2 *Xj + ...+ Bm * Xf, y is the dependent variable, X; are the 
independent variables, A and Bi denote the unknown coefficients. 

So-called "linkage terms" of the independent variables (for instance the linkage terms for 2 
1 0 independent variables and a polynomial degree of 3 are Xi * X2, Xl * Xi, Xi*Xl) are not 
f== introduced for convenience. 



Both of the above mentioned methodologies suffer from deficiencies relating to the same cause. 
U1 As the actual polynomial degree of the functional relationship of the dependent and the 

1 is independent variables is not known and both methods are based on a fixed assumed functional 

= relationship - the multiple linear regression approach assuming a linear functional relationship 

and the multiple polynomial regression approach assuming a function relationship of a common 
' = predefined polynomial of degree m for all independent variables - their results are not 
O satisfactory. Prediction models based on a multiple linear regression approach fail to predict 

2 0 non-linear functional relationships in an acceptable manner. Prediction models based on a 

multiple polynomial regression approach generally will generate unsatisfactory results if the 
assumed common polynomial degree m deviates from the actual functional relationship. In these 
cases prediction quality (of an actual functional relationship of polynomial degree K 
approximated by a polynomial of degree m) fits only well in a close neighborhood of the trained 
2 5 data areas. This deficiency even increases by increasing the degree m for the fitting polynomials. 

Moreover, the multiple polynomial regression approach is characterized by poor computational 
efficiency. It always has to compute polynomials for the independent variables with an assumed 
maximum polynomial degree m even if the functional relationship is of simpler form (for 
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instance linear); the computational process of determining said polynomials as well as the 
process of evaluating the polynomial suffers from this drawback. 

Summary of the Invention 

5 The present invention teaches a computerized data mining method for determining a prediction 
model for a dependent data mining variable based on a multitude of n independent data mining 
variables. 

The method comprises a variable replacement step replacing the independent data mining 
1 0 variable with potential values from a global range by a multitude of independent local data 

mining variables. Each independent local data mining variable relates to potential values from a 
sub-range of the global range. 

m The method further comprises an initialization step initializing a current prediction model and a 

1 iM. looping sequence. The looping sequence comprises a first step of determining for every 

s independent local data mining variable not yet reflected in said current prediction model a 

y multitude of partial regression functions (depending only on one of the independent local data 

^ mining variables); and of determining for each of the partial regression functions a significance 

O value; and of selecting the most significant partial regression function and the corresponding not 

2 6' yet reflected local data mining variable. The looping sequence comprises a second step of adding 

said most significant partial regression function to the current prediction model. 

The proposed method achieves improved prediction quality within the training data areas without 
losing the stability for the untrained data areas. Moreover, the proposed teaching of the current 
2 5 invention is able to detect different functional dependencies of the dependent data mining 

variable within different sub-ranges of an independent data mining variable by introducing a 
multitude of local data mining variables. This capability is most important to cope with 
dependent variables occurring in practical applications which are very often characterized by 
varying functional dependencies within the individual sub-ranges of the independent variables. 
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Brief Description of the Drawings 



Figure 1 is a flow chart depicting in more details the mode of operation of the partial stepwise 
polynomial regression methodology in a preferred embodiment of the invention based on the 
5 calculation of specific regression functions, namely regression polynomials. 

Figure 2 visualizes the calculated prediction models according to various approaches (including 
that of the current invention) in comparison to the training data by indicating the differences 
between the values of the training data and the calculated values according to the prediction 
1 0 model (residuals). 



Figure 3 shows for a certain example a comparison of the prediction models determined 
according to various variants of the Partial Stepwise Polynomial Regression methodology. 

ISij Description of the Preferred Embodiment 

In the drawings and specification there has been set forth a preferred embodiment of the 
H invention and, although specific terms are used, the description thus given uses terminology in a 
pj generic and descriptive sense only and not for purposes of limitation. It will, however, be evident 
5= that various modifications and changes may be made thereto without departing fi-om the broader 
2 cl=- spirit and scope of the invention as set forth in the appended claims. 

The present invention can be realized in hardware, software, or a combination of hardware and 
software. Any kind of computer system - or other apparatus adapted for carrying out the methods 
described herein - is suited. A typical combination of hardware and software could be a general 
2 5 purpose computer system with a computer program that, when being loaded and executed, 

controls the computer system such that it carries out the methods described herein. The present 
invention can also be embedded in a computer program product, which comprises all the features 
enabling the implementation of the methods described herein, and which - when being loaded in 
a computer system - is able to carry out these methods. 

30 
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Computer program means or a computer program in the present context means any expression, in 
any language, code or notation, of a set of instructions intended to cause a system having an 
information processing capability to perform a particular function either directly or after either or 
both of the following a) conversion to another language, code or notation; b) reproduction in a 
different material form. 



The current invention is using term "data mining" in its most general meaning generically 
referring to all types of mining technologies which sometimes are distinguished further in "data 
mining" and "text mining". 



1 . 1 The Stepwise Polynomial Regression Methodology 

A significant improvement has been made available by a commonly-assigned patent application, 
Stepwise Polynomial Regression, U.S. Ser. No. 09/608,468, which is incorporated herein by 
reference. 



Stepwise Polynomial Regression uses a set of polynomials, but for each independent variable 
the 'best' fitting polynomial is selected. The 'best' fitting polynomial shows the least squared 
error of the predicted and the observed values from the dependent variable. Actually the best 
fitting polynomial is determined according to a significance measure. The regression formula of 
Stepwise Polynomial Regression is: 



y = A + lUAXi) + Error 
where fiXi) = i select one of 



Ba*X} 
Bn*X}+Bi2*Xj 



(eq.2) 



Bii * Xj +Bi2*Xj + ...+BiM* Xf 
and where y is the dependent variable, X,- are the n independent variables, A and Bij denote the 
unknown coefficients, M is maximum degree for the potential regression polynomials; from 
within this multitude of potential regression polynomial the most significant one is selected to be 
used as the actual prediction model. 
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In spite of all improvements provided by this technology, predictors determined according to 
stepwise polynomial regression suffer from the deficiency that a functional relationship can be 
described only by one of the provided continuous functions. In reality there are situations where 
an dependent variable has to be described by different functional dependencies in different areas 
of the overall range of the independent variables. The above mentioned teaching would be able to 
generate only a compromise solution as a prediction model. Moreover, noncontinuous 
relationships or functional relationships which are not part of the provided set of regression 
polynomials can only be approximated in a limited manner. In addition, if the maximum 
polynomial degree is increased in the Stepwise Polynomial Regression approach to increase the 
spectrum of potential functional dependencies, it can be observed that this decreases the stability 
of the prediction model for data which are not in the neighborhood of the trained data areas. 

Thus it can be observed that even the current Stepwise Polynomial Regression methodology 
requires further improvements. 

1 . 2 The Partial Stepwise Polynomial Regression Methodology 

For an explanation of a further embodiment of the current invention relating to the specific 
technology of calculating a regression function, the focus of the discussion is put temporarily on 
Fig. 1. Fig. 1 is a flow chart depicting in more details how according to a preferred embodiment 
of the invention a regression function is calculated by an iteration process. Fig. 1 concentrates as 
a first approach on the calculation of specific regression functions, namely regression 
polynomials. In the next chapter an extension to general regression functions will be described. 

The current invention proposes a new methodology for determining a prediction model for data 
mining variables which is called Partial Stepwise Polynomial Regression. 

The invention proposes two major approaches. The partial stepwise polynomial regression using 
a fixed number of ranges or regions for each independent variable and a dynamic approach 
using variable number of ranges or regions for each independent variable. The fixed number 
approach does not result in region intervals with the same length. In contrast to that it is 
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explicitly noted that the variable region number approach will result in variable region intervals 
and variable number of regions. 

1.2.1 The Partial Stepwise Polynomial Regression Methodology With A Fixed Number of 
Regions 

Partial Stepwise Polynomial Regression with a fixed number of regions is an enhanced 
regression method further improving the stepwise polynomial regression approach. Stepwise 
polynomial regression predicts the value of the dependent variable on the basis of n independent 
variables, where each independent variable is expressed by a single continuous polynomial with 
its individual polynomial degree over the complete range of each independent variable. 

According to a fundamental observation leading to the current invention it is pointed out that in 
practical situations the functional dependency of a dependent variable on a set of independent 
variables very often is of a nature not allowing to approximate for prediction purposes that 
dependent variable with a prediction model which is identical throughout the whole range of the 
values of the independent variables. This situation is caused by a multitude of reasons. For 
instance, within a first sub-range of an independent variable the functional dependency to the 
dependent variable may be of a first type while in another sub-range the functional dependency 
may be of a second type. Also it might be possible that the functional dependency shows 
discontinuities. Due to these reasons a dependent data mining variable within typical applications 
cannot be approximated by a prediction model which is identical throughout the whole range of 
values of the independent variables. Therefore a methodology is suggested which is able to 
provide multiple partial regression functions for each independent variable. Instead of using an 
identical analytical function for the complete data area, that is the complete range, of an 
independent variable, the data area (range) is split up into regions or sub-ranges. Each 
region/sub-range is contributing its own continuous prediction function to the overall prediction 
model. This results in a prediction function for each independent variable which is determined by 
multiple different analytical functions. Each region/sub-range has it own prediction function to 
describe the relationship for the specific data area. Expressed in other words, an independent 
data-mining variable with potential values from a global-range is replaced by a multitude of 
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independent local-data-mining-variables, each independent local-data-mining-variable with 
potential values from a sub-range of said global-range. 

The above teaching may not be merely reduced to modeling a functional dependency on a global 
range of the independent variables by simply breaking it up into a multitude of prediction models 
dedicated to the individual sub-ranges of the global-range. As the individual sub-ranges of the 
independent variables are represented by independent local-data-mining-variables participating 
within the iteration process described below, each individual sub-range of each individual 
independent data mining variable with its associated prediction model is competing 
independently from one another for recognition within the selection process based on its 
significance to the overall prediction model. This results in an optimization approach going far 
beyond the normal stepwise polynomial regression methodology. 

Thus Partial Stepwise Polynomial Regression suggests the use of multiple polynomials of 
potentially different degrees for each independent variable. Each polynomial describes a partial 
data area - called region or sub-range - of an independent variable. The invention proposes a 
technique allowing to determine individually a partial relationship for each region of an 
independent variable. The partial regression polynomials determined along these lines are 
combined to form the overall prediction model for the independent variable. 

Assuming a maximum degree M for the potential regression polynomials and H regions for each 
independent variable, the Partial Stepwise Polynomial Regression method may be expressed with 
the following formula: 



y^A + I.Uf(Xi) + Error 



(eq. 3) 



H 



Zik = Xi; regio < Zn < regn else regtk-i < Zik < reguc 
Zik = 0 



where J{Xd=I.AZik) 
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and where y is the dependent variable, Xi are the independent variables, k is the index number of 
regions, Zik are the independent variables of specific regions where rega determines the upper 
border of region k. 



For each 



select one of 



Biki 



Biki * Z\ + Biki * Z\ + ...+BikM*Zfk 
where A and Bikj denote the unknown coefficients. 

Partial Stepwise Polynomial Regression determines region specific regression polynomial, i.e. 
the fitting curve (eq. 3), in such a manner, that for each region from all potential regression 
polynomials up to a maximum degree M a specific regression polynomial is selected. The 
combination of all region specific polynomials determines the regression function of an 
independent variable. The sum of all regression functions from the independent variables is the 
regression function for the dependent variable. 

Fig. 1 visualizes a summarizing overview on the proposed Partial Stepwise Polynomial 
Regression methodology by a flow diagram. 

Partial Stepwise Polynomial Regression starts by setting the maximum polynomial degree 101 
M. This step limits the set of regression polynomials from which the method selects the most 
significant one. 

Within step 102 the global-ranges for the independent data-mining- variables X; are set. 

In step 103 the number of regions/sub-ranges H are set and the H regions themselves are 
determined for each independent variable. This step can be viewed as a 
variable-replacement-step replacing the independent data-mining variable with potential values 
from a global-range by a multitude of new independent local-data-mining- variables, each 
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independent local-data-mining-variable with potential values from a sub-range of said 
global-range. In "ex post" observation with the current invention it turned out that in addition to 
the number of regions H, the determination of the regions themselves causes the most impact to 
the prediction quality of the finally determined prediction model. 

5 

Within an initialization step 104 a current prediction model is initialized. In most cases the 
method starts with an empty prediction model, which means that none of the regions of the 
independent variables is represented by a regression polynomial in its functional relationship 
with the independent variable. 

10 

o The steps 105 to 108 represent a looping sequence which can be repeated until a certain 
% termination criterion is fulfilled. Within step 105 the method determines, if all independent 
|f candidate regions, i.e. the sub-ranges, have been reflected already in the current version of the 
U1 prediction model. This can be achieved by just counting the number of already reflected 
1 4==? regions/sub-ranges. Within step 106 for every region/sub-range not yet refiected in the prediction 
" model a multitude of regression polynomials with different polynomial degrees is determined 
M based on the set of training data. In the most far-reaching embodiment of the invention, for every 

not reflected region all regression polynomials according to eq. 3 of all degrees up to the 
= :^ maximum degree M are determined. Next, for each of said candidate polynomials their 
2 0 significance value is determined. The significance measures, based on the set of training data, the 
degree of improvement of the current prediction model, if a regression polynomial would be 
added to the prediction model. The significance is thus a measure of the appropriateness of a 
regression polynomial to reflect the functional relationship with the dependent variable, i.e. 
showing the "closest" relationship with the dependent variable. Within step 107 the most 

2 5 significant potential regression polynomial according to this significance measure and its 

corresponding region is then selected and added to the current prediction model, thus reflecting 
said region of an independent variable within the prediction model. 

Within step 108 the method checks the fulfillment of a termination criterion. According to a 

3 0 basic implementation of the invention the termination criterion is a test whether all regions of all 
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independent variables have been included in the prediction model by contributing a regression 
polynomial. The final prediction model at the point in time vi'hen the method terminated 
represents the prediction model as determined by the invention. 



It is important to recognize that the suggested methodology is able to remove the constraint that a 
regression polynomial of an independent variable must have a higher polynomial degree to 
describe more complex relationship. Moreover the invention even allows one to describe 
non-polynomial relationships, for instance the Gaussian relationship. 

1.2.2 The Region Determination 

The region borders of the i'th independent variable are defined as 



regik 



-00 k<l 

Ci -dIi/2 +jf *k k>l,k<H 
00 k = H 



(eq.4) 



where H defines the maximum number of regions, dli determines the definition interval for the 
region borders, i.e. the global range, and d is a specific center of the i'th independent variable. 
Various methods to define such a center are possible; one most successful, concrete definition for 
the center is given below. 

A specific implementation of a region determination according to the current invention is based 
on the following features. 

The most significant measures describing distributions of data are the mean value and the 
standard deviation. 



The mean value is defined as 

^Mean - n (eq. 5) 

the number of observations is n and the standard deviation is the square root of the empirical 
variance s = ^fS with the empirical variance 

^^llii^z^ (eq.6) 
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Assuming the mean value of the i'th independent variable as center d the definition interval dli 
for the region borders as four times of its standard deviation Si this results in a specific region 
determination of 



regik = i 



-00 k<l 

can ~ 25 + ^ * k k'>\, 

00 k = H 



(eq.4.1) 



1.2.3 The Significance Measure 

As a first observation the significance measure of a regression polynomial for a region of an 
independent variable is reduced to the simpler problem of a significance measure of the 
individual powers of a regression polynomial. 

According to a preferred implementation the significance of a polynomial is >= the minimum 
significance of all its linear predictors: 

Bikl ^ Zjf^, Bikl ^ Zfj^, — , BikM ^ Z (eq. 7) 

Thus the significance of a regression polynomial is determined by the smallest significance of 
any of its powers. 

Starting from this definition a significance measure for the linear predictors is required. For the 
linear predictors, the invention suggests to exploit the F-test to test whether a predictor influences 
the dependent variables or not. The F-test is a statistical test, well-known in the statistics, that 
checks whether two estimates of the variance of two independent samples are the same. In 
addition, the F-test checks whether the so-called NULL hypothesis is true or false. In application 
to the current situation assuming the inverse hypothesis, that "a predictor has no influence on the 
dependent variable", this leads to the following NULL hypothesis for the F-test: "a coefficient 
Bikj in a linear regression model (with respect to the various -Sj^y) is zero". 



For a single linear predictor the test statistic is 
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with a degree of freedom of in - 2) where Sz,tY is the empirical covariance between Zik (a 
region of an independent variable) and Y (the dependent variable) and S is the empirical 
variance of the region variable Zik. In this special case the T-statistic t is equal to the root of the 
F-statistic. This remark indicates that also other statistical measures (like the T-Test) could be 
used as foundation for the significance measure approach of the current invention. 

From the perspective of a practical computation, the F-Test value of a certain regression 
polynomial the F-Test value is determined on the basis of tentatively adding the regression 
polynomial to the current prediction model and on the basis on the training data set. 

Based on this calculated F-test value the probability of obtaining a larger F-test value (Probability 

> F) can be determined according to the statistic theory of the F-test. If this probability tends to 
zero there is a statistical evidence for rejecting the NULL hypothesis. Or in other words: the more 
the F-test value approaches the value of 1, the larger is the support that the NULL hypothesis is 
true, indicating a small significance of the corresponding linear predictor (power of the 
regression polynomial); vice versa: the more the F-test value approaches the value of 0, the larger 
is the support that the NULL hypothesis is false, indicating a large significance of the 
corresponding linear predictor. 

Thus the invention proposes as significance measure of a linear predictor to use the (Probability 

> F) value based on the F-test theory. 

1.2.4 Partial Stepwise Polynomial Regression Conditionally Adding Regions 
(Sub-Ranges) of Independent Variables 

Partial Stepwise Polynomial Regression allows for several optimization strategies of the 
methodology depending on the particular objective. The proposed improvements are targeted at 
reducing the number regions (sub-ranges) from the independent variables which contribute to the 
prediction model. Stated in other terms, the improvements of the method will reflect not all of 
the possible regions (sub-ranges) within the prediction model and will limit the number of 
regions to those which contribute to a "larger degree" to the functional relationship with the 
dependent variable. The coefficients of the eliminated regions are zero. 
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A first improvement of Partial Stepwise Polynomial Regression will add regions to the set of 
regions reflected in the prediction model conditionally only. This first improvement is exploiting 
the so-called adjusted R square measure also called adjusted correlation coefficient. The 
adjusted R square measure is well known within the state of the art. 

This first improvement results in an enhanced step 107 of Fig. 1. Instead of unconditionally 
adding the most significant regression polynomial to the prediction model, it is first determined, 
if its inclusion would improve the adjusted correlation coefficient of the resulting 
prediction-model with respect to the set of training data. Only in the affirmative case that 
regression polynomial and the corresponding region is added to the prediction model. Otherwise 
the corresponding region is excluded from said method without further participation in the 
iteration process. 

More particularly, if step 106 indicates the most significant regression polynomial and its 
corresponding region and if this region is the j-th region to be added to the prediction model, the 
selection criteria for actual adding this region to the prediction model is: 



where R^ is the squared correlation coefficient with respect to the fitted and observed values, q is 
the number of observations (i.e. the number of training records), p is the number of independent 
predictors comprised by the regression polynomials within the current prediction model. In other 
words, the number of independent predictors p is equal to the number of the unknown 
coefficients B tk m ■ 

The correlation coefficient R is calculated by dividing the covariance from the observed (i.e. 
according to the training data) and the predicted values with the variance from the observed 
values and the variance from the predicted values. 



where the adjusted R square coefficient R' for linear predictors is: 
R'^=\-{q-mq-p-\)*{l-R^) 



(eq. 8) 



(eq.9) 
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Therefore 

R =S^:{SySy) (eq.lO) 
where Syyi is the empirical covariance of Y and Y' which is determined by 

^3^, = (l^l^iYi-YMeJiY'i-YMeJViq-l) (eq. 11) 

5 and where Y, are the observed values of the dependent variables and Y' are the predicted values. 

As a result the Partial Stepwise Polynomial Regression with adjusted R square optimization 
eliminates all regions which do not improve the adjusted correlation coefficient, i.e. which do not 
improve the prediction model. As an important computational advantage this results in a 
1 0 prediction model requiring a smaller number of regions. Only those regions which improve the 
n prediction quality will become part of the prediction model. 

fl 1.2.5 Partial Stepwise Polynomial Regression Conditionally Adding and Removing 
III Regions 

1 is A second improvement of Partial Stepwise Polynomial Regression will likewise add regions 

= conditionally only to the set of regions reflected in the prediction model. Moreover it also will 
M remove regions from the prediction model again in case of certain conditions. Thus the second 
'fi improvement is targeted to determine a prediction model with as few regions as possible. 

2 0 This second improvement results in an enhanced step (106) of Fig. 1 . Listead of unconditionally 

adding the most significant regression polynomial to the prediction model, it is first determined if 
the significance of the currently most significant regression polynomial is above a predefined 
threshold significance value. In the affirmative case, only said currently most significant 
polynomial is added to the prediction model. Additionally, this second improvement of the 

2 5 invention enhances the looping sequence reflected in Fig. 1 by a third step succeeding step 107. 

Within this new step it is determined if the significance of a certain regression polynomial (or a 
multitude of regression polynomials) comprised within the current prediction model is reduced 
after the last regression polynomial has been added to the prediction model. If this is the case, 
said certain regression polynomial together with its corresponding region is removed from the 

3 0 current prediction model. Though this region is no longer reflected in the prediction model, it 
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may of course participate within the further iteration process; i.e. a removed region can be added 
again in one of the next steps of the iteration. Another alternative handling is to exclude a region, 
which once has been removed from the prediction model, from said method without further 
participation in the iteration process. 

5 

These steps (adding, removing of regions) are repeated until all regions whose significance is 
higher than the specified threshold significance value are added to the model. This algorithm is 
called stepwise regression with full forward (adding regions) and backward (removing regions) 
capabilities. Expressing the termination criterion in other words, the looping sequence is 
1 0 terminated if the significance of the currently most significant regression polynomial is below 
said threshold significance. 

V With respect to the comparison of significance values it has to be stressed that the significance of 
m a variable is higher if its (Probability > F) value tends to zero. That means, a variable is added 

1 % I when its (Probability > F) value is lower than the given significance threshold. 

U: As the result of this second improvement the suggested methodology provides the possibility to 
=; I find a prediction model in terms of minimizing the number of required regions. This results in a 
C minimization of independent variables. 

1 . 3 Extension of the Invention to Categorical Variables 

Besides operating on data mining variables being numeric in nature, the proposed methodology 
can be enhanced to also handle so-called categorical variables. Categorical variables are 
variables which can have a value out of a set of discrete values. A categorical variable will be 

2 5 converted into a vector wherein the vector elements are treated like independent variables 

representing the potential values of the categorical variable. Thus a categorical variable is treated 
as a [0,1] vector of its potential values. 
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For instance the categorical variable X having n members, i.e. n vector elements 
["member_l","member_2", ...,"member_n"] is mapped to the independent variables (i.e. 
predictors): 

X("member_l") = [0,1] 

X("member_2") = [0,1] 

X("member_n") = [0,1] 

1 . 4 Example of a Prediction Model According to the Current Invention 

In the following, a comparison of the results of linear regression, stepwise polynomial regression 
and partial stepwise polynomial regression (according to the current invention) applied to the 
same training data set is presented. Based on the training data set all regressions predict the 
independent (mining) variable "income/capital" from the three independent variables (the 
concrete nature of these variables is not important for the current explanation): 
[AGRI, SERVICE, INDUS] 

All independent variables are numeric. The results of each prediction model is shown in Fig. 2. 
Fig. 2 visualizes the fitting curve showing the residuals (that is, the deviation between the exact 
values of the independent variable and its value according to the prediction model) of the 
observed values versus the predicted values. 

2 

The prediction model generated according to the linear regression method resulted in a R of 

2 

0.634596, where R is the squared correlation coefficient between observed and fitted values, and 
the following prediction model: 

income/capital = -4660.13211905839 

+ 40.9309547592514 * AGRI 
+ 59.239686679445 * SERVICE 
+ 59.8879962521713 * INDUS 

(sample 1) 
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The prediction model generated according to the Stepwise Polynomial Regression methodology 

2 

resulted in an R of 0.7961 81 and the following prediction model: 



income/capital = -11771.9127789895 

- 3.28336940177977 * AGRP 

+ 0.0469395359133741 *AGRP 
+ 287.254225684245 * INDUS 

- 7.66112617982519 * INDUS' 
+ 0.057519013287087 * INDUS^ 
+ 1123.30302703306 * SERVICE 

- 35.2179332880916 * SERVICE^ 
+ 0.346957993399851 * SERVICE^ 

(sample 2) 



The prediction model generated according to the Partial Stepwise Polynomial Regression 

2 

methodology resulted in an R of 0.997138 and the following prediction model: 



income/captial = 

406.178558348123 

+ 26.337989486862 * (AGRI >= 7. 145 and < 25.6) 

- 45.450827692626591 * (AGRI >= 7.145 and < 25.6)' 
+ 833.49621951468 * (AGRI >= 25.6 and <44.0544) 

- 45.450827692626591 * (AGRI >= 25.6 and <44.0544)2 
+ 0.600474152375959 * (AGRI >= 25.6 and <44.0544)3 

- 20.5338136610743 * (AGRI >= 44.054) 
+ 14.5607462031416 * (INDUS < 31.580) 

+ 444.293086313939 * (INDUS >= 43.35 and < 55.319) 

- 22.2991534042399 * (INDUS >= 43.35 and < 55.319)' 
+ 0.263576382781448 * (BsfDUS >= 43.35 and < 55.319)^ 
+ 216.7300647299952 * (SERVICE < 22.607) 

- 9.040889620505 1 * (SERVICE < 22.607)' 

- 176.731141985718 * (SERVICE >= 31.15 and < 39.69) 
+ 8.6463067407576 * (SERVICE >= 31.15 and < 39.69)' 

- 0.0869526250292018 * (SERVICE >= 31.15 and < 39.69)' 

- 419.782430750505 * (SERVICE >= 39.69) 
+ 21 .5409362796845 * (SERVICE >= 39.69)' 

- 0.26503551689648 * (SERVICE >= 39.69)' 

(sample 3) 



19 



DE9-2000-0016-US1 



As can be seen from the comparison of the resulting prediction models visualized within Fig. 2, 
the Partial Stepwise Polynomial Regression approach has been able to detect for certain regions, 
i.e. sub-ranges, very precisely the different functional relationships between the independent data 
mining variables and the dependent data mining variable differing from sub-region to sub-region. 
5 One example in this respect is the variable "AGRF'; in this case the Partial Stepwise Polynomial 
Regression approach moreover denied a polynomial functional contribution for the region (AGRI 
>= 44.054) in contradiction to the prediction model generated by the Stepwise Polynomial 
Regression approach. 

10 1.5 Comparison of Certain Partial Stepwise Polynomial Regression Variants 

O Based on the training data set of the previous chapter with the problem of determining a 
5 prediction model for the dependent variable "income/captial" dependent on the 3 independent 
S variables, the prediction models determined according to various variants of the Partial Stepwise 
W Polynomial Regression methodology are compared in Fig. 3. Fig. 3 reflects the squared 

1 Sii correlation coefficient "R square", the "number regions", i.e. the number of sub-ranges and their 

I, corresponding independent local data mining variables, selected as most appropriate for the 

model versus the following Partial Stepwise Polynomial Regression variants: 
In u the standard Partial Stepwise Polynomial Regression method ("Standard") 

ri ■ the Partial Stepwise Polynomial Regression conditionally adding independent 

2 0 variables based on the adjusted correlation coefficient ("Adjusted R square") 

■ the Partial Stepwise Polynomial Regression conditionally adding and removing 
independent variables based on the significance threshold of 0.3 ("Stepwise 0.3") 

■ the Stepwise Polynomial Regression conditionally adding and removing independent 
variables based on the significance threshold of 0.4 ("Stepwise 0.4") 

25 

By referring to Fig. 3 the standard Partial Stepwise Polynomial Regression algorithm will predict 
the dependent data mining variable "income/captial" with a i?^ = 0.997138. Using the adjusted R 
squared optimization results in a decrease of the prediction quality to = 0.912805, but 
requiring only 9 regions (instead of 18). Finally Partial Stepwise Polynomial Regression 

3 0 conditionally adding and removing regions with a significance threshold of 0.3 results in a 
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prediction quality of 7?^ = 0.998117 with 15 regions. By increasing the significance threshold 
from 0.3 to 0.4 (remember: a reduction of the numerical value corresponds in a lowering of the 
significance level of the threshold) the prediction quaUty slightly increases to = 0.998752 but 
the number of regions also increases from 15 to 16. 

Thus, the standard Partial Stepwise Polynomial Regression methodology may be further 
improved in terms of the prediction model quality by the two variants. Most remarkably is that 
both variants lead to a reduction of the number used regions and this leads to a reduction of the 
number of independent variables, but there is a major usage differentiation. While adjusted R 
squared optimization does not need any user knowledge, the conditionally adding and removing 
regions function requires a deeper knowledge of the prediction result quality. 

1 . 6 Partial Stepwise Polynomial Regression With Variable Number of Regions 

Partial Stepwise Polynomial Regression with variable number of regions is an enhanced 
regression method improving the partial stepwise polynomial regression with fixed number of 
regions approach. Partial Stepwise Polynomial Regression with fixed number of regions predicts 
the value of the dependent variable on the basis on n independent variables. Each independent 
variable is expressed by a fixed number of polynomials, one polynomial per region. The number 
of regions is fixed and each of them has potentially different polynomial degrees. This further 
embodiment of the invention proposes a new technique in addition, allowing to determine the 
number and sizes of regions/sub-ranges for each independent variable individually. 

Assuming a maximum degree M for the potential regression polynomials and Hmox as maximum 
number of regions (not a fixed number H as in the previous embodiment) for each independent 
variable; this also means, that the final number of sub-regions for the individual independent 
variables may be different from one another. Partial Stepwise Polynomial Regression with 
variable number of regions method may be expressed with the following formula: 
y = A+ llliAXi) + Error (eq. 12) 
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where A^d^l^K^ik) 

k=i 



Zik = Xi; regio < Zn < regn else regik-i < Zik < regik 
Zik^O 



with \<Hi< Hmox 

and where y is the dependent variable, Xi are the independent variables, k is the number of 
regions. Hi is the maximum number of regions for the i'th independent variable and Zik are the 
independent variables of specific regions where regik determines the upper border of region k. 



For each 



JiZik) = 



Biki * Z\. 



select one of 



Biki *Zl + Bik2*Z\ 



Biki * Zji, + Bik2 *Z% + ... + BikM * Z 



■M 



where A and Bikj denote the unknown coefficients. 

Partial Stepwise Polynomial Regression determines region specific regression polynomial, i.e. 
the fitting curve (eq. 3), in such a manner, that for each region from all potential regression 
polynomials up to a maximum degree M a specific regression polynomial is selected. The 
combination of all region specific polynomials determines the regression function of an 
independent variable. The sum of all regression functions from the independent variables is the 
regression function for the dependent variable. 



1.6.1 Determination of Initial Regions/Sub-Ranges 

The initial region borders of the i'th independent variable are defined as 
-00 k<l 



iregik = i 



dl: 



Ci -dIi/2 +-ff *k 



00 



k>l,k<H 
k = H 



(eq.l3) 



where H defines the maximum number of regions, dh determines the definition interval for the 
initial region borders and d is a specific center of the i'th independent variable. 
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A specific implementation of an initial region determination according to the current invention is 
based on the following features. 

The most significant measures describing distributions of data are the mean value and the 
standard deviation. 



The mean value is defined as 

Zuean = (eq. H) 

the number of observations is n and the standard deviation is the square root of the empirical 
variance s = ^fS with the empirical variance: 



S = 



(«-i) 



(eq. 15) 



Assuming the mean value of the i'th independent variable as center d the definition interval dli 
for the region borders as four times of its standard deviation Si this results in a specific initial 
region determination of 



iregik = ' 



— 00 

■Z; Mean ~ 
00 



k<l 

k>l,k<H 
k = H 



(eq.13.1) 



1.6.2 Determination of the Final Regions/Sub-Regions 

The final region borders of the i'th independent variable are defined as 



regik = select one of 



iregi(j+\) 



iregm^ 



j=j';regi(k-i) = regyr, k> 1 



(eq. 15) 



that the interval regi(k-\), regik contain at least Np disjoint observations of the independent 
variable . 



In verbal representation eq. 15 defines a method wherein the sub-ranges (regions) and the 
corresponding local-data-mining-variables Zik are of variable size determined by an iterative 
procedure comprising the following steps: 
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a. an initial step of dividing the global-range into maximum number H of equidistant 
sub-ranges (regions), 

b. an iteration step of selecting a certain sub-range for which the number of the training 
data falling into a certain sub-range is below a third threshold and joining said certain 
sub-range with a neighbor sub-range forming a larger sub-range, and 

c. a termination step of terminating the iteration step if for each sub-range the number of 
the training data falling into said each sub-range is equal to or above the third threshold. 

The result of this iterative approach to determine the number and sizes of the sub-ranges results 
in a situation wherein a size of a sub-range is on one hand large enough to comprise enough 
training information for calculating a corresponding prediction model, and on the other hand the 
sizes of the sub-ranges are small enough to take into consideration different functional 
dependencies in different value ranges of the independent variables. 

Further attention should be given to eq. 13.1 with respect to the specific treatment of the two 
sub-ranges between -qo and the lower limit of the global-range as well as between the upper limit 
of the global-range and +00. As can be seen from this equation, it is suggested that the 
local-data-mining-variables are augmented by the following sub-ranges and corresponding 
independent local-data-mining-variables: 

1 . a local-data-mining-variable representing a sub-range from -00 up to the lower limit of 
the global-range; and/or 

2. a local-data-mining-variable representing a sub-range from the upper limit of the 
global-range up to +co. 
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The advantage of such approach is that these specific sub-ranges characterized by comprising a 
limited number of training data only (it is remembered that due to the specific construction of the 
global-range based on the mean value and the standard deviation of the values of the independent 
variable as explained above most training data belong to the global-range) nevertheless can 
participate in a "lump-sum" treatment in the proposed determination procedures. Even though 
little knowledge on the functional dependency is available in these areas (due to the limited 
number of training data) the proposed teaching will provide stable prediction models in these 
sub-ranges. 



1.6.3 Example of a Prediction Model Using Variable Number of Regions 

In the following a comparison of the result of partial stepwise polynomial regression using a 
fixed number of regions and of the result of using a variable number of regions is presented. The 
regression model of the fixed number approach is shown in (sample 3) above. 



Using the same training data for the variable number of region approach results in a R of 

2 

0.9992, where R is the squared correlation coefficient between observed and fitted values with 
the following prediction model: 



income/capital = 

686734.967 

+ 315.724 *(AGfU<25.6) 
+ 10.062 *(AGM< 25.6)2 
-0.41441 *(AGRI< 25.6)3 
+ 53667.040 * (AGRI >= 25.6) 

- 3053.65 * (AGRI >= 25.6)^ 
+ 39.87 * (AGRI >= 25.6)2 

- 17381.39 * (INDUS < 31.58) 

- 16136.22 * (INDUS >= 31.58 and < 43.45) 

- 42322.28 * (INDUS >= 43.45) 
+ 850.87 * (INDUS >= 43.45)^ 

- 5 .62 * (INDUS >= 43 .45)^ 

- 1214745.52 * (SERVICE < 22.60) 
+ 77755.95 * (SERVICE < 22.60)^ 

- 938.5 1 * (SERVICE < 22.60)^ 

+ 620.29 * (SERVICE >= 31.15 and < 39.69) 

- 45.80 * (SERVICE >= 31.15 and < 39.69)^ 
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+ 0.82 * (SERVICE >= 31.15 and < 39.69)' 
+ 0.04 * (SERVICE >= 39.69)' 

(sample 4) 

5 The comparison between (sample 3) and (sample 4) shows that the dynamic solution uses only 
two regions for the variable AGRI versus 3 regions from the fixed region approach. Any 
reduction in a number of independent variables (directly related to a reduction of the number of 
sub-ranges) means an improvement with respect to computational efficiency, that is processing 
time. 

10 

1 . 7 Extension by Using General Regression Functions Instead of Regression Polynomials 

In a further embodiment of the current invention its teaching may be readily extended to the 
O calculation of a prediction model based on a multitude of regression functions as a generalization 
£ of regression polynomials. The only difference is that within step 106 for every independent 

ISS local-data-mining-variable (relating to a certain sub-range) not yet reflected in said current 
Ul prediction-model a multitude of partial regression fiinctions are calculated. A partial 
O regression-function within this context means a regression function (not necessarily a 
f , polynomial) which depends on one of the local-data-mining-variables only. The rest of the steps 
H of the above described methodology, like the step of determining a significance value for a 

2 Ofi partial regression function, the step of selecting the most significant partial regression function 
rf and the step of adding the most significant partial regression function to the prediction model can 
be performed in absolute correspondence to above, regression polynomial based teaching. 

The current invention proposes a data mining method for determining a prediction model that 

2 5 allows one to determine the actual polynomial degree of the unknown functional relationship of 

the dependent and the independent variables. The method can be exploited therefore at same time 
to generate prediction models for linear and non-linear functional relationships. It is no longer 
necessary to assume a functional relationship of a common predefined polynomial degree m for 
all independent variables. As the proposed method is able to determine the actual functional 

3 0 relationship, the prediction quality is improved at the same time. This advantage will even 

increase for values of the independent variable residing not in the close neighborhood of the 
training data areas. 
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Being able to determine the actual functional relationship between the dependent and the 
independent variable is of special importance in the area of data mining as it is its imperative 
target to show a user the functional dependencies in analytical terms. 

Prediction models determined according to the current invention moreover are characterized by 
increased computational efficiency in terms of processing time and memory space (compared to 
prediction models with a comparable number of independent variables) as it avoids the need to 
compute polynomials for the independent variables with an assumed maximum polynomial 
degree m even if the functional relationship is of simpler form (for instance linear). 

Finally the Stepwise Polynomial Regression is a highly scalable, high performance methodology 
for the determination of prediction models. It clearly outperforms other methodologies currently 
viewed as providing the best state of the art scalability behavior. 

Further embodiments of the invention allow to determine prediction models with reduced 
number of independent variables (compared to prediction models wherein the number of 
independent variables has to be predefined and is not self-adjusted as in the current case), which 
at the same time improve the quality of the prediction models, as only those independent 
variables will become part of the prediction model, which improve the prediction quality. This 
property is of an important computational advantage within data mining as the frequency a 
prediction model is (re-) computed is low compared to frequency a given prediction model is 
used to compute the corresponding dependent variable. Finally a reduced number of independent 
variables provides exploiters of data mining technology a deeper insight in the unknown 
fundamental functional relationship between the variables. Once this knowledge is available it 
can be utilized for the provision of new data to be analyzed by data mining techniques: the scope 
of mining data to be sampled can be limited to those data records comprising the most relevant 
independent variables only. This leads to cost reduction when applying data mining techniques. 
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Moreover the feature of the invention allowing to determine "prediction models with a reduced 
number of independent variables" can be expressed in other words as the capability to 
automatically determine from a large set of independent variables that subset which actually 
influence the dependent variable. This capability addresses a fundamental data mining problem: 
the typical starting point for the application of data mining techniques are huge data assets with a 
lot of different variables for which nobody knows which of the variables are functionally 
influencing a certain dependent variable. Current state of the art technology requires to rely on 
human "intuition" or human knowledge for selecting the (hopefully) "correct" independent 
variables to be included in the prediction model. Furthermore, once a certain set of independent 
variables has been selected as basis for the prediction model, one never could be sure on the 
correctness of the selection; i.e. important variables could have been ignored by the human expert 
or independent variables could have been added with a minor influence on the dependent 
variable only, which unnecessarily increases the computational complexity of the prediction 
model. Thus the teaching on the other hand provides a method for automatically determining 
from the huge set of potential variables those with the most significant functional influence on 
the independent variable. The prediction quality of a prediction model according to the proposed 
technology is therefore improved in two directions: first by determining regression polynomials 
with a polynomial degree adapted to the actual functional dependency; second by determining 
those independent variables with the most significant influence on the dependent variable. For 
experienced exploiters of data mining technology the proposed teaching helps to reduce the effort 
to identify those portions of the huge data assets being of relevance for the prediction models to 
be computed. For other addressees, not having the knowledge on the functional dependencies of 
variables, the current invention may represent the enabling technology for exploitation of data 
mining techniques altogether. 
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